Lexical feature selection
The main question is this: how can we identify terms that are over-represented in a given set of document?
Methods
Log odds ratio informative Dirichlet prior
When you want to contrast two corpora (e.g. Democrats vs. Republican). Needs a (big) background corpus. Seems to work really well in many cases.
An interesting application to restaurant menus: http://uncommonculture.org/ojs/index.php/fm/article/view/4944/3863
code: https://gist.github.com/yy/a2fff314073c4806fd5b